Troubleshooting

Note Sitecore server and client hardware and software requirements at http://sdn.sitecore.net/SDN5/Products/Sitecore%20V5/Installation%20Guide/Requirements/Server%20Requirements.html and http://sdn.sitecore.net/SDN5/Products/Sitecore%20V5/Installation%20Guide/Requirements/Client%20Requirements.html .
Upgrade to the latest release of Sitecore if possible.
Check the installation and troubleshooting instructions for Sitecore at http://sdn.sitecore.net/Products/Sitecore%20V5/Installation%20Guide.html.
Confirm the version of Sitecore in use is running under the appropriate version of the .NET framework.
Especially in development a restart of ASP.NET, IIS or even the machine may be required, and logging out, closing all browser windows and clearing the browser cache may resolve certain issues with the Sitecore user interfaces.
Always check for newly installed code.
Check for other applications using resources.
Confirm the W3WP process is not consuming inordinate resources.
Check any other .NET sites hosted on the same machine.
Check any ISAPI filters or other processes under IIS.
Check antivirus recommendations.
Check for database backups and other process interference.
Check for any scheduled tasks.
CPU utilization should be based on the number of Sitecore requests processed; excess CPU without interactivity should only be caused by custom code and published tasks such as scheduled processes including publishing.
Monitor whether publication occurs by adding the event handler to <event name="publish:begin" > in the web.config file (see Using Events article for more details).
Compare /web.config to the default /web.config provided with the appropriate version of Sitecore. Attempt to diagnose by disabling or otherwise configuring any differences.
Use profiling in the debugger to identify pages consuming inordinate resources.

1. Identifying Slow or Leaking Pages

This section describes how to use global.asax file to identify requested pages, where certain thresholds are exceeded. Here we concentrate on:

Pages taking more than X milliseconds to execute.
Pages leaking more than X Kb data.

The former is useful if your website has an unusually high CPU load and or database access, but initial profiling could not reveal any misbehavior on the most obvious pages. By tracing all pages taking more than a specified timeframe to render we will get a good starting point for further investigations. For instance, it could be the Sitemap page that uses a sitemap.xslt with a "//" expression in its xpath.

The latter is useful if you experience memory leaks and are looking for single pages that could leak memory.

In both cases dtSearch or a similar tool can be used to trigger requests to all pages.

Getting Started

Rename your original global.asax in the root of your site.
Unpack the v5global.asax in the provided zip file and rename it to global.asax.
Set the booleans mentioned below to false in global.asax:

bool logmemory = true;
bool logtime = true;

The Time Threshold

Set logtime = true, and

int timeThreshold = <Your_Preferred_Threshold in Milliseconds>; // ms

Save and close.

You should now start getting entries in the log file like:

3064 14:47:53 INFO Page exceeded time threshold (ms): 1172,1225 ( Url: /sitecore/content/home/default)

The Memory Threshold

Set logmemory = true, and

int memoryThreshold = <YourPreferredThreshold in Kb>; // ms

Save and close.

You should now start getting entries in the log file like:

3064 14:47:53 INFO Page exceeded memory threshold (Kb): 172 ( Url: /sitecore/content/home/default)

Important note: In this mode, the cache will be flushed on each request in order to ensure that only true (see the note below) memory leaks are logged. However this will have considerably negative effect on your website performance.

Note : Sitecore uses the ASP.NET cache to store Sitecore items, which will cause an increase in memory consumption. This is however not a true memory leak as the memory will be released when the item is removed from cache eventually.

You can comment out the lines that flush the cache in order to improve performance but you should now be aware that the information logged will be more blurred.

If a leaking page is found, repeatedly refreshing this page should now lead to a noticeable increase in memory usage.

Download the source code for this article.

2. Scheduled Publications

The frequency of scheduled publication can be manipulated but note that publishing clears the HTML cache and therefore can interfere with performance. Review http://sdn.sitecore.net/SDN5/FAQ/Administration/Scheduled%20tasks.html and manipulate the Sitecore.Tasks.PublishAgent which is by default disabled by comments /web.config.

3. Digging gold in IIS logfiles

This section describes how to use the free tool from Microsoft "Log Parser 2.2" with IIS logfiles.

Have you ever experienced that CPU usage skyrockets on your server, or at least somehow experieced long response times from Sitecore. Then you have likely ended up with a question like this:

I wonder what pages on my site (be it on the website or in the client) are consuming all the resources?

Well, elsewhere here on SDN5 you can read how global.asax can be used to log all files taking longer than x milliseconds. However it doesn't give you the full picture, and you have to wait until enough sample data have been collected. You don't want to wait, you want answers NOW? Well if the column "Time taken" is added to your IIS logfiles, you can get these answers. If not, hurry up and add it.

So you open your IIS logfile in Excel and do some sorting, filtering and grouping to show these data, right? Yes, if you have less than 65.000 entries and enough patience, maybe you do, but it doesn't exactly work like a dream. If you (hopefully) are familiar with SQL, you are likely to wine something like "Sigh, why can't I just do some SUM's, COUNT's and AVERAGE's in sql instead of this nightmare Excel stuff?". Now there is a solution. Try e.g.:

Just Download LogParser 2.2 from http://www.microsoft.com/downloads/details.html?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en
Install it on your webserver
From the logfolder on your website, run the following command: SELECT TOP 10 COUNT(*), AVG(time-taken), SUM(time-taken), cs-uri-stem INTO report_sum.txt FROM test.log WHERE EXTRACT_EXTENSION(cs-uri-stem)='aspx' GROUP BY cs-uri-stem ORDER BY sum(time-taken) DESC
Now view the top 10 urls that spent the most time being rendered in the outputfile report_sum.txt

It works like a charm - and fast. When opening the report_sum.txt afterwards, hey presto: a list as promised, showing that Default.html (frontpage of website) topped the list with 1100+ seconds totally. Also it tells me that my Maillist.html averages at a high 3 seconds per request.

COUNT(ALL *) AVG(ALL time-taken) SUM(ALL time-taken) cs-uri-stem
------------ ------------------- ------------------- ----------------
976          1139                1111939             /Default.html
100          2959                295996              /Aktuelt/Postlister.html
146          989                 144425              /Job.html
20           4663                93269               /sitecore/.../Upload/Upload2.html
...

I must admit this is nothing less than a Super Cool Tool that will allow you to easily identify problem pages in your installation. The way these SQL statements can be combined is endless, and it can even be used with Sitecore 5 logfiles.

Another usefull query I would like to show is the following that will give you statistics per hour from your logfile.

"c:\program files\log parser 2.2\logparser" "SELECT date, QUANTIZE(time,3600) as interval, COUNT(*), AVG(time-taken), SUM(time-taken), DIV(TO_REAL(COUNT(*)), 3600) AS Requests/sec INTO 'report_total_load_per_minute.txt' FROM %LOGFILE% GROUP BY date, interval ORDER BY date, interval" -o:NAT

As can be seen below, this query will e.g. tell you that there were 111K+ requests between 12a.m. and 1p.m., average time was 144 ms, and there was a whopping 31 Requests / second during this period. The total time taken is 16097 seconds or app. 4400ms per second. Now there are only 1000 ms on a real second, but as these data were taken from a 4 processor server you would multiply this to get 4000 expected CPU milliseconds per second.

date       interval COUNT(ALL *) AVG(ALL time-taken) SUM(ALL time-taken) Requests/sec
---------- -------- ------------ ------------------- ------------------- ------------
2006-07-03 12:00:00 111391       144                 16086950            30.941944
2006-07-03 13:00:00 133281       106                 14210042            37.022500
2006-07-03 14:00:00 50872        150                 7678055             14.131111
2006-07-03 15:00:00 29795        198                 5921612             8.276389
2006-07-03 16:00:00 10403        435                 4526177             2.889722

I would like to introduce a key value called "CPU-Utilization" that is to be measured the following way:

100 pct * Avg(time-taken) * (Requests / second) / (#CPU * 1000).

From the above numbers, the CPU-Utilization would now be calculated like this: 144ms * 31 ReqPerSec) / (4 * 1000) = 112%.

What does this number tell you? Well you would not expect to utilize your server above 100%, and a number above this can only mean that requests have been queued or idle within the server before a response was delivered. I would say that any CPU-utilization exceeding 80% is not sound over longer periods of time, and in these cases you should seriously consider if the workload can be decreased or the hardware should be added to the installation. The limit of 80% is taken out of the blue and may well be set too high. At a CPU-U rate of 112%, something is terribly wrong and your users are probably suffering badly.

Summary:

This super cool tool allows us to easily determine the workload of the server back in time as well as pinpoint urls that are very resource demanding. Standard .bat files can be created, which produces detailed reports on the basis of one or more logfiles in seconds. If your customer or IT department insists that their single CPU server is more than enough hardware to do the job, this may just be the evidence needed to convince them otherwise.